Search CORE

16 research outputs found

End-to-End Lyrics Recognition with Self-supervised Learning

Author: Garcia Leibny Paola
He Zhanhong
Li Shuyue Stella
Togneri Roberto
Zhang Xiangyu
Publication venue
Publication date: 26/10/2022
Field of study

Lyrics recognition is an important task in music processing. Despite traditional algorithms such as the hybrid HMM- TDNN model achieving good performance, studies on applying end-to-end models and self-supervised learning (SSL) are limited. In this paper, we first establish an end-to-end baseline for lyrics recognition and then explore the performance of SSL models on lyrics recognition task. We evaluate a variety of upstream SSL models with different training methods (masked reconstruction, masked prediction, autoregressive reconstruction, and contrastive learning). Our end-to-end self-supervised models, evaluated on the DAMP music dataset, outperform the previous state-of-the-art (SOTA) system by 5.23% for the dev set and 2.4% for the test set even without a language model trained by a large corpus. Moreover, we investigate the effect of background music on the performance of self-supervised learning models and conclude that the SSL models cannot extract features efficiently in the presence of background music. Finally, we study the out-of-domain generalization ability of the SSL features considering that those models were not trained on music datasets.Comment: 4 pages, 2 figures, 3 table

arXiv.org e-Print Archive

A New Approach to Extract Fetal Electrocardiogram Using Affine Combination of Adaptive Filters

Author: Garcia Leibny Paola
Li Shuyue Stella
Shen Zihan
Togneri Roberto
Xuan Yu
Zhang Xiangyu
Publication venue
Publication date: 26/10/2022
Field of study

The detection of abnormal fetal heartbeats during pregnancy is important for monitoring the health conditions of the fetus. While adult ECG has made several advances in modern medicine, noninvasive fetal electrocardiography (FECG) remains a great challenge. In this paper, we introduce a new method based on affine combinations of adaptive filters to extract FECG signals. The affine combination of multiple filters is able to precisely fit the reference signal, and thus obtain more accurate FECGs. We proposed a method to combine the Least Mean Square (LMS) and Recursive Least Squares (RLS) filters. Our approach found that the Combined Recursive Least Squares (CRLS) filter achieves the best performance among all proposed combinations. In addition, we found that CRLS is more advantageous in extracting FECG from abdominal electrocardiograms (AECG) with a small signal-to-noise ratio (SNR). Compared with the state-of-the-art MSF-ANC method, CRLS shows improved performance. The sensitivity, accuracy and F1 score are improved by 3.58%, 2.39% and 1.36%, respectively.Comment: 5 pages, 4 figures, 3 table

arXiv.org e-Print Archive

Bypass Temporal Classification: Weakly Supervised Automatic Speech Recognition with Imperfect Transcripts

Author: Gao Dongji
Garcia Leibny Paola
Khudanpur Sanjeev
Povey Daniel
Wiesner Matthew
Xu Hainan
Publication venue
Publication date: 01/06/2023
Field of study

This paper presents a novel algorithm for building an automatic speech recognition (ASR) model with imperfect training data. Imperfectly transcribed speech is a prevalent issue in human-annotated speech corpora, which degrades the performance of ASR models. To address this problem, we propose Bypass Temporal Classification (BTC) as an expansion of the Connectionist Temporal Classification (CTC) criterion. BTC explicitly encodes the uncertainties associated with transcripts during training. This is accomplished by enhancing the flexibility of the training graph, which is implemented as a weighted finite-state transducer (WFST) composition. The proposed algorithm improves the robustness and accuracy of ASR systems, particularly when working with imprecisely transcribed speech corpora. Our implementation will be open-sourced

arXiv.org e-Print Archive

PQLM -- Multilingual Decentralized Portable Quantum Language Model for Privacy Protection

Author: Garcia Leibny Paola
Li Shuyue Stella
Liang Ruixing
Liu Hexin
Shu Hongchao
Zhang Xiangyu
Zhou Shu
Publication venue
Publication date: 26/10/2022
Field of study

With careful manipulation, malicious agents can reverse engineer private information encoded in pre-trained language models. Security concerns motivate the development of quantum pre-training. In this work, we propose a highly portable quantum language model (PQLM) that can easily transmit information to downstream tasks on classical machines. The framework consists of a cloud PQLM built with random Variational Quantum Classifiers (VQC) and local models for downstream applications. We demonstrate the ad hoc portability of the quantum model by extracting only the word embeddings and effectively applying them to downstream tasks on classical machines. Our PQLM exhibits comparable performance to its classical counterpart on both intrinsic evaluation (loss, perplexity) and extrinsic evaluation (multilingual sentiment analysis accuracy) metrics. We also perform ablation studies on the factors affecting PQLM performance to analyze model stability. Our work establishes a theoretical foundation for a portable quantum pre-trained language model that could be trained on private data and made available for public use with privacy protection guarantees.Comment: 5 pages, 3 figures, 3 table

arXiv.org e-Print Archive

Unidirectional brain-computer interface: Artificial neural network encoding natural images to fMRI response in the visual cortex

Author: Garcia Leibny Paola
Kumar Avisha
Leadingham Kelley M. Kempski
Li Qiong
Liang Ruixing
Liu Hexin
Manbachi Amir
Punnoose Joshua
Wei Lai
Zhang Xiangyu
Publication venue
Publication date: 26/09/2023
Field of study

While significant advancements in artificial intelligence (AI) have catalyzed progress across various domains, its full potential in understanding visual perception remains underexplored. We propose an artificial neural network dubbed VISION, an acronym for "Visual Interface System for Imaging Output of Neural activity," to mimic the human brain and show how it can foster neuroscientific inquiries. Using visual and contextual inputs, this multimodal model predicts the brain's functional magnetic resonance imaging (fMRI) scan response to natural images. VISION successfully predicts human hemodynamic responses as fMRI voxel values to visual inputs with an accuracy exceeding state-of-the-art performance by 45%. We further probe the trained networks to reveal representational biases in different visual areas, generate experimentally testable hypotheses, and formulate an interpretable metric to associate these hypotheses with cortical functions. With both a model and evaluation metric, the cost and time burdens associated with designing and implementing functional analysis on the visual cortex could be reduced. Our work suggests that the evolution of computational models may shed light on our fundamental understanding of the visual cortex and provide a viable approach toward reliable brain-machine interfaces

arXiv.org e-Print Archive

Investigating model performance in language identification: beyond simple error statistics

Author: Chua Victoria Y. H.
Dauwels Justin
Khong Andy W. H.
Khudanpur Sanjeev
Liu Hexin
Perera Leibny Paola Garcia
Styles Suzy J.
Woon Fei Ting
Publication venue
Publication date: 30/05/2023
Field of study

Language development experts need tools that can automatically identify languages from fluent, conversational speech, and provide reliable estimates of usage rates at the level of an individual recording. However, language identification systems are typically evaluated on metrics such as equal error rate and balanced accuracy, applied at the level of an entire speech corpus. These overview metrics do not provide information about model performance at the level of individual speakers, recordings, or units of speech with different linguistic characteristics. Overview statistics may therefore mask systematic errors in model performance for some subsets of the data, and consequently, have worse performance on data derived from some subsets of human speakers, creating a kind of algorithmic bias. In the current paper, we investigate how well a number of language identification systems perform on individual recordings and speech units with different linguistic properties in the MERLIon CCS Challenge. The Challenge dataset features accented English-Mandarin code-switched child-directed speech.Comment: Accepted to Interspeech 2023, 5 pages, 5 figure

arXiv.org e-Print Archive